Setup

rm(list=ls())
knitr::opts_knit$set(root.dir = '~/OneDrive - University College London/Projects/Ostridge_PanAf/baypass/analysing_baypass_output/baypass_aux/') 

Parameters

subsps=c('all', 'ce', 'w')
env_data='f_over_sum_known_trees'
habitat_col='grey'
fprs=c(0.005, 0.001, 0.0005)
chr21=TRUE
flanks=1000

Here I extract candidates from running the BayPass AUX model with habitat classifications from Aleman et al. (2020) (using 4 categories) for all subspecies datasets.

Environmental data input files were made with /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/environmental_data/scripts/1.extract_environmental_and_behavioural_data.v3.Rmd and /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/environmental_data/scripts/2.format_env_file.Rmd.

BayPass was run on myriad with scripts in myriad:analysis/phase1and2_exome_analysis/baypass/scripts/mapped.on.target/. BayPass was run three times with different seeds.

BayPass output files were formatted with /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/phase1and2_exomes/baypass/baypass_aux/scripts/format_output/format_baypass_aux_output.ipynb where SNP coordinates, gene annotations and results from multiple independent runs were added. Statistics with no suffix represent results from the focal run (e.g. ‘M_Beta’), and statistics from the the two repeat runs have the suffixes ‘.r1’ and ‘.r2’ (e.g. ‘M_Beta.r1’ and ‘M_Beta.r2’). Median values have also been calculated. I also add the results from random runs used to generate nulls.

Library

library(data.table)
options(datatable.fread.datatable=FALSE)
library(dplyr)
library(ggplot2)
library(ggVennDiagram)
library(psych)
library(tidyverse)
library(cowplot)
library(reshape)
library(ggpattern)
# Call scripts containing custom functions 
source("../scripts/baypass_tools.R")
source("../baypass_core/scripts/candidate_allele_frequency_patterns.R")
source("scripts/baypass_aux_tools.R")

# In
env_dir="../../../environmental_data/output/"
baypass_core_out_dir="../baypass_core/output/formatted_baypass_core_output/"
formatted_aux_out_dir="output/formatted_baypass_aux_output/"
ac_in="../../../allele_frequencies/"
# Out
formatted_aux_out_fprs_dir="output/baypass_aux_output_with_fprs/"
gowinda_snp_file_dir="../../../gowinda/baypass_aux/output/gowinda_input/snp_files/"

Read in data

## - all 
## Read chr21
## - ce 
## Read chr21
## - w 
## Read chr21

Candidate SNPs

Beta - effect size and direction

  • An absolute beta value of more than 0.2 was considered “strongly associated” in the BayPass paper.

Bayes Factor - measure of significance

  • Jeffreys’ rule (Jeffreys 1961); >10 strong evidence, >15 - very strong evidence, >20 - decisive evidence.

Use Median BFs to Select Candidates

##                     Var1  Freq Subspecies  Tail_pct
## 1 f_over_sum_known_trees  2515        all 0.4827116
## 2 f_over_sum_known_trees 28898         ce 9.1758908
## 3 f_over_sum_known_trees   270          w 0.1540533

Generating null distributions

Selecting SNPs using commonly used BF thresholds such as 20 results in far to many candidates (many 1000s). Instead, here I select thresholds based on estimated FPR using an estimated null distribution. Nulls are generated wither from randomly shuffling environmental data or running baypass with chr21.

Randomly shuffle environmnetal data

Here I select thresholds based on estimated FPR I estimate FPR by running BayPass with habitat classifications which are randomly shuffled within subspecies. I run BayPass on with 5 different randomly shuffled habitat classifications.

In order to account for run-to-run variation, I use the median values over three independent runs with different seeds (in depth exploration of this in /Users/harrisonostridge/OneDrive - University College London/Projects/PanAf/phase1and2_exomes/baypass/baypass_aux/scripts/all/virishti_habitats.all.baypass_aux_output.run_to_run_var.Rmd) as recommended in the BayPass manual. Below is a summary of all the runs.

Number of Focal Runs (Using True Habitat Assignments)

(# subspecies datasets) × (# independent repeats)

3 × 3 = 9

Number of Random Runs (Using Habitat Assignments Randomly Shuffled Within Subspecies)

(# subspecies datasets) × (# random habitat datasets) × (# independent repeats)

3 × 5 × 3 = 45

Total Number of Runs

9 + 45 = 54

This value increases if I need to separate the dataset into subsets. It requires a lot of computational time. This method also cannot be comparable to the genetics-only test under the core model as there is no equivelent.

non-genic-chr21 null

Unlike the exomes, non-coding regions are expected top be evolving relatively neutrally. Running BayPass with the same environmental data but with this ‘more neutral’ genetic data should give an appropriate null distribution.

This is the method we chose

M_P

M_P is the posterior mean of the parameter P corresponding to the overall proportion of SNPs associated with each given covariable.

If M_P of the real data is greater than the random nulls then we will likely see an excess of candidates compared to the null below.

Null volcano plots

Each plot represents the median values (across three independent runs with different seeds) from running BayPass on a particular subspecies dataset..

## Null is chr21
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning in max(betai[[y]]): no non-missing arguments to max; returning -Inf
## Warning in min(betai[[y]]): no non-missing arguments to min; returning Inf
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 row containing missing values (`geom_step()`).
## Warning: Transformation introduced infinite values in continuous y-axis

Null vs real data

## Null is chr21
## all

## ce

## w

The excess of positive beta values is striking in ‘all subspecies’.

There is an excess of positive betas for ‘all subspecies’ and central-eastern but not in western.

Select candidates

Calculate FPR per coverage bin

We found that candidates were more likely to have lower coverages than expected from the background. We therefore decided to calculate FPRs within candidate bins to account for this.

FPRs are estimated using a null distribution. SNPs in the null distribution are assigned empirical p values (BF rank/total number of SNPs). The null data is then combined with the exome data and ordered by BF. The FPR for each of the exome SNP is given as the smallest empirical p value of any null SNP with a lower or equal BF.

Example (winthin a single coverage bin):
Null BFs:   0   1   1   4   8
  Ranks:    5   4   4   2   1
  p:        1   0.8 0.8 0.4 0.2
  
Exome BFs:  0   2   3   9   10
  FDR:      1   0.8 0.8 0.2 0.2

We can see that in tied instances we use the lowest rank and that the minimum FPR possible is determined by the number of null SNPs.

This is done for each coverage bin.

## Null is chr21
## Null is chr21
## Null is chr21
## [1] 83938
## [1] 81573

Threshold stats

## Coverage Bin Stats

## Number of Candidates

## Coverage Bin Stats

## Number of Candidates

## Coverage Bin Stats

## Number of Candidates

Match ancestral allele frequency

n_bins=5
betai_bins=list()
betai_chr21_bins=list()
stat='mean_DAF'
# Note that in the BayPass output, M_P means estimated ancestral allele frequency or estimated population allele frequency depending on the file 

for(subsp in subsps){
  if(stat=='mean_DAF'){
    # Calculate mean population allele frequency
    for(data in c('exome', 'chr21')){
      rownames(ac[[subsp]][[data]])=1:nrow(ac[[subsp]][[data]])
      # Calculate  DAF
      ## From allele counts file
      dac=ac[[subsp]][[data]][, grepl(".dac$", names(ac[[subsp]][[data]]))]
      aac=ac[[subsp]][[data]][, grepl(".aac$", names(ac[[subsp]][[data]]))]
      daf=dac/(dac+aac)
      daf[daf=='NaN']=NA
      xtx[[data]][[subsp]]$mean_DAF=rowMeans(daf, na.rm = TRUE)
      ## Alternative way based on BayPass standradised allele frequencies
      #mean_DAF.exome=as.data.table(allele.freq[[subsp]])[, .(mean_DAF = mean(M_P)), by = MRK]
      #xtx[['exome']][[subsp]]=merge(xtx[['exome']][[subsp]], mean_DAF.exome, by='MRK')
    }
  }
  max=max(max(xtx[['exome']][[subsp]][[stat]]), max(xtx[['chr21']][[subsp]][[stat]]))
  # Bins of consistent width
  #breaks=seq(0, max, max/n_bins)
  # Bins of consistent number of SNPs
  breaks=quantile(xtx[['exome']][[subsp]][[stat]], probs = seq(0, 1, length.out = n_bins + 1))
  midpoints=(breaks[-1] + breaks[-length(breaks)]) / 2
  plot=ggplot(NULL)+
    theme_bw()+
    theme(plot.title = element_text(hjust = 0.5))+
    geom_density(aes(x=xtx[['exome']][[subsp]][[stat]], col='Exome')) +
    geom_density(aes(x=xtx[['chr21']][[subsp]][[stat]], col='Non-genic-chr21')) +
    geom_vline(xintercept = breaks, color = "black", linetype = "dashed") +
    labs(title=subsp_names[[subsp]], x=stat) +
    geom_text(aes(x=midpoints, y=0, label=1:n_bins))
  print(plot)
  
  for(data in c('exome', 'chr21')){
    # Bins of consitent width
    #xtx[[data]][[subsp]][[stat]]_bin=cut(xtx[[data]][[subsp]][[stat]], breaks = seq(0, max, max/n_bins), include.lowest=T)
    # Bins of consistent number of SNPs
    xtx[[data]][[subsp]]$stat_bin=cut(xtx[[data]][[subsp]][[stat]], breaks = breaks, include.lowest = TRUE)
    #bins=unique(xtx[[data]][[subsp]]$stat_bin)
    bins=levels(xtx[[data]][[subsp]]$stat_bin)
    bin_names=paste0("bin", 1:length(bins))
    names(bin_names)=bins
    # Numbers for bin names
    xtx[[data]][[subsp]]$stat_bin=bin_names[xtx[[data]][[subsp]]$stat_bin]
  }
  bins=unique(xtx[['exome']][[subsp]]$stat_bin)
  bins=bins[order(bins)]
  for(bin in bins){
    # Exome
    mrks=xtx[['exome']][[subsp]][xtx[['exome']][[subsp]]$stat_bin==bin, 'MRK']
    betai_bins[[bin]][[subsp]]=betai[[subsp]][betai[[subsp]]$MRK %in% mrks,]
    # chr21
    mrks=xtx[['chr21']][[subsp]][xtx[['chr21']][[subsp]]$stat_bin==bin, 'MRK']
    betai_chr21_bins[[bin]][[subsp]]=betai_chr21[[subsp]][betai_chr21[[subsp]]$MRK %in% mrks,]
  }
}

for(bin in paste0("bin", 1:n_bins)){
  cat(bin, "\n")
  plot_fpr_stats(betai_fpr_bins[[bin]], null=betai_chr21_bins[[bin]], fprs=fprs, top.cov=0.025, N_SNPs=TRUE, BF_thresh=TRUE, coverage=TRUE)
}

LD thinning

I thin the exome and non-genic-chr21.

tmp=list()
betai_fpr.prune=list()
betai.prune=list()
betai_chr21.prune=list()
for(i in 2:11){
  for(subsp in subsps){
    #betai_fpr.prune[[paste0(i)]][[subsp]]=betai_fpr[[subsp]][seq(i, nrow(betai_fpr[[subsp]]), by = i),]
    betai.prune[[paste0(i)]][[subsp]]=betai[[subsp]][seq(i, nrow(betai[[subsp]]), by = i),]
    betai_chr21.prune[[paste0(i)]][[subsp]]=betai_chr21[[subsp]][seq(i, nrow(betai_chr21[[subsp]]), by = i),]
    
    tmp=assign_fpr(betai=betai.prune[[paste0(i)]][subsp], null=betai_chr21.prune[[paste0(i)]][subsp], n_bins=5, tail_bin_size=0.1, verbose=F)
    betai_fpr.prune[[paste0(i)]][[subsp]]=tmp[[subsp]]
  }
}
for(i in names(betai_fpr.prune)){
  plot_fpr_stats_main.v4(betai_fpr.prune[[i]], betai_chr21.prune[[paste0(i)]], fprs=fprs, thresh_stat='BF(dB).median', title=paste0("Every ", i, " SNPs"), verbose=F, plot_ratio=F)
}

Selecting savannah and forest candidates seperately

This was done to check whether selection in either direction is contributing to the excess of SNPs strongly associated with the exome.

## Coverage Bin Stats
## all 
## ce 
## w 
## Number of Candidates

## forest
## Coverage Bin Stats
## all 
## ce 
## w 
## Number of Candidates

## savannah
## Coverage Bin Stats
## all 
## ce 
## w 
## Number of Candidates

## forest
## Coverage Bin Stats
## all 
## ce 
## w 
## Number of Candidates

## savannah
## Coverage Bin Stats
## all 
## ce 
## w 
## Number of Candidates

for(subsp in subsps){
  cat(subsp, "\n")
  for(fpr in fprs){
    cat(" FPR<", fpr, "\n")
    betai_fpr.tmp=betai_fpr[[subsp]][betai_fpr[[subsp]]$fpr<=fpr, ]
    abs_b=abs(betai_fpr.tmp$`M_Beta.median`)
    for(beta in c(0.05, 0.1, 0.15, 0.2)){
      abs_b.tmp=abs_b[abs_b>beta]
      cat("   N SNPs with abs(beta) >", beta, ": ", length(abs_b.tmp), " (",100*(length(abs_b.tmp)/length(abs_b)),"% )\n")
    }
    cat("\n")
    df.tmp=aggregate(`BF(dB).median` ~ coverage_bin, betai_fpr.tmp, function(x) min(x))
    colnames(df.tmp)=c("Coverage bin", "Minimum BF(dB)")
    print(df.tmp)
  }
}
## all 
##  FPR< 0.005 
##    N SNPs with abs(beta) > 0.05 :  4543  ( 57.81369 % )
##    N SNPs with abs(beta) > 0.1 :  1122  ( 14.27844 % )
##    N SNPs with abs(beta) > 0.15 :  171  ( 2.176126 % )
##    N SNPs with abs(beta) > 0.2 :  4  ( 0.05090354 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       17.58898
## 2            2       16.76234
## 3            3       15.13869
## 4            4       14.90257
## 5            5       14.73321
##  FPR< 0.001 
##    N SNPs with abs(beta) > 0.05 :  2483  ( 87.0312 % )
##    N SNPs with abs(beta) > 0.1 :  1059  ( 37.11882 % )
##    N SNPs with abs(beta) > 0.15 :  171  ( 5.993691 % )
##    N SNPs with abs(beta) > 0.2 :  4  ( 0.1402033 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       21.21587
## 2            2       19.97372
## 3            3       18.94423
## 4            4       18.66133
## 5            5       18.33974
##  FPR< 5e-04 
##    N SNPs with abs(beta) > 0.05 :  1829  ( 92.56073 % )
##    N SNPs with abs(beta) > 0.1 :  935  ( 47.31781 % )
##    N SNPs with abs(beta) > 0.15 :  171  ( 8.653846 % )
##    N SNPs with abs(beta) > 0.2 :  4  ( 0.2024291 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       21.66304
## 2            2       21.00371
## 3            3       20.16485
## 4            4       20.04321
## 5            5       19.50428
## ce 
##  FPR< 0.005 
##    N SNPs with abs(beta) > 0.05 :  4744  ( 99.95786 % )
##    N SNPs with abs(beta) > 0.1 :  4621  ( 97.3662 % )
##    N SNPs with abs(beta) > 0.15 :  3210  ( 67.6359 % )
##    N SNPs with abs(beta) > 0.2 :  590  ( 12.43152 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       25.81588
## 2            2       25.65826
## 3            3       25.47842
## 4            4       24.79733
## 5            5       24.96238
##  FPR< 0.001 
##    N SNPs with abs(beta) > 0.05 :  892  ( 100 % )
##    N SNPs with abs(beta) > 0.1 :  890  ( 99.77578 % )
##    N SNPs with abs(beta) > 0.15 :  796  ( 89.23767 % )
##    N SNPs with abs(beta) > 0.2 :  411  ( 46.07623 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       30.27637
## 2            2       31.05851
## 3            3       30.50466
## 4            4       30.11147
## 5            5       31.90612
##  FPR< 5e-04 
##    N SNPs with abs(beta) > 0.05 :  462  ( 100 % )
##    N SNPs with abs(beta) > 0.1 :  461  ( 99.78355 % )
##    N SNPs with abs(beta) > 0.15 :  428  ( 92.64069 % )
##    N SNPs with abs(beta) > 0.2 :  258  ( 55.84416 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       32.83620
## 2            2       32.83620
## 3            3       33.43079
## 4            4       31.90612
## 5            5       37.32474
## w 
##  FPR< 0.005 
##    N SNPs with abs(beta) > 0.05 :  373  ( 97.64398 % )
##    N SNPs with abs(beta) > 0.1 :  213  ( 55.75916 % )
##    N SNPs with abs(beta) > 0.15 :  68  ( 17.80105 % )
##    N SNPs with abs(beta) > 0.2 :  6  ( 1.570681 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       19.90424
## 2            2       19.50428
## 3            3       18.08659
## 4            4       18.97943
## 5            5       17.13364
##  FPR< 0.001 
##    N SNPs with abs(beta) > 0.05 :  76  ( 98.7013 % )
##    N SNPs with abs(beta) > 0.1 :  72  ( 93.50649 % )
##    N SNPs with abs(beta) > 0.15 :  43  ( 55.84416 % )
##    N SNPs with abs(beta) > 0.2 :  5  ( 6.493506 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            1       30.74346
## 2            2       25.20398
## 3            3       22.77906
## 4            4       24.03658
## 5            5       21.69918
##  FPR< 5e-04 
##    N SNPs with abs(beta) > 0.05 :  28  ( 100 % )
##    N SNPs with abs(beta) > 0.1 :  28  ( 100 % )
##    N SNPs with abs(beta) > 0.15 :  23  ( 82.14286 % )
##    N SNPs with abs(beta) > 0.2 :  4  ( 14.28571 % )
## 
##   Coverage bin Minimum BF(dB)
## 1            2       26.93564
## 2            3       27.19034
## 3            5       29.03633

Split betai test

Because there appears to be a bias towards positive values in central-Eastern in the null and exome I though maybe we need to calculate FPR values separately for sites with positive and negative beta values to account for this.

## Null is chr21
## Null is chr21
## Null is chr21

This has basically no effect at all. I expect that the slightly smaller number of candidates in the split beta version is just because we can be less precise about estimated FPRs due to fewer null SNPs and we always round down. This suggests to me that perhaps large betas do not necessarily correspond to large BFs? This has convinced me that the beta split is not the best option.

Candidate volcano plots

There are a number of candidates with abs(betai) < 0.1 (> 0.1 is considered large effect) at less stringent thresholds in western and particularly in all samples.

Most candidates have abs(betai) > 0.1 for central-eastern, I expect this is because of the effect of Issa Valley.

Candidate Overlap

## Common SNPs only

It is interesting that roughly 50% of western candidates are shared with those in the ‘all samples’ dataset whereas central-eastern shares a much smaller proportion with the ‘all samples’ dataset.

Distributions

As expected, there is basically no association between w and ce candidates but there is with each of those and all - particularly between all and w.

Write files

if(chr21){
  suffix=paste0(".non-genic_",flanks,"bp.flanks")
}else{
  suffix="-cov_cor"
}
for(subsp in subsps){
  if(subsp=='all'){subsp.file=''}else{subsp.file=paste0(".", subsp)}
  if(subsp=='n'){miss.pop=0}else{miss.pop=0.3}
  # Write FPR data
  write.table(betai_fpr[[subsp]], 
              paste0(formatted_aux_out_fprs_dir, "/f5", subsp.file ,".pops_missing.pops.", miss.pop,"_minMAC2_", 
                     env_data, "_summary_betai.out_all_runs.fpr", suffix),
              row.names=F, sep="\t")
  for(fpr in fprs){
    # Write Gowinda input files
    for(cov in unique(betai_fpr[[subsp]]$COVARIABLE_name)){
      ## Both
      write.table(unique(betai_fpr[[subsp]][betai_fpr[[subsp]]$COVARIABLE_name==cov & betai_fpr[[subsp]]$fpr<fpr, c("chr", "pos")]), 
                  paste0(gowinda_snp_file_dir, env_data,"/",subsp,".", env_data,"-",cov,".fpr",100*fpr, "pct" ,suffix),
                  col.names=F, row.names=F, sep="\t")
      ## Positive beta
      write.table(unique(betai_fpr[[subsp]][betai_fpr[[subsp]]$COVARIABLE_name==cov & betai_fpr[[subsp]]$fpr<fpr & betai_fpr[[subsp]]$M_Beta.median>0, c("chr", "pos")]), 
                  paste0(gowinda_snp_file_dir, env_data,"/",subsp,".", env_data,"-",cov,".fpr",100*fpr, "pct" ,suffix, ".pos_beta"),
                  col.names=F, row.names=F, sep="\t")
      ## Negative beta
      write.table(unique(betai_fpr[[subsp]][betai_fpr[[subsp]]$COVARIABLE_name==cov & betai_fpr[[subsp]]$fpr<fpr & betai_fpr[[subsp]]$M_Beta.median<0, c("chr", "pos")]), 
                  paste0(gowinda_snp_file_dir, env_data,"/",subsp,".", env_data,"-",cov,".fpr",100*fpr, "pct" ,suffix, ".neg_beta"),
                  col.names=F, row.names=F, sep="\t")
    }
  }
}